This dataset documents information on cancer death rates for
every county in the United States. The dataset was found on Data World,
uploaded by Noah Rippner as a challenge to predict the outcome variable,
TARGET_deathRate (mean per capita (100,000) cancer
mortalities). The link to his project page is here: https://data.world/nrippner/ols-regression-challenge. In
his description he cites contributions of his aggregated data set from
the American Community Survey (census.gov), clinicaltrials.gov, and
cancer.gov. The dataset was downloaded from Data World as a csv file and
uploaded to this Rmd file as seen above.
| Name | cancer |
| Number of rows | 3047 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| binnedInc | 0 | 1 | 16 | 18 | 0 | 10 | 0 |
| Geography | 0 | 1 | 16 | 42 | 0 | 3047 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| avgAnnCount | 0 | 1.00 | 606.34 | 1416.36 | 6.00 | 76.00 | 171.00 | 518.00 | 38150.00 |
| avgDeathsPerYear | 0 | 1.00 | 185.97 | 504.13 | 3.00 | 28.00 | 61.00 | 149.00 | 14010.00 |
| TARGET_deathRate | 0 | 1.00 | 178.66 | 27.75 | 59.70 | 161.20 | 178.10 | 195.20 | 362.80 |
| incidenceRate | 0 | 1.00 | 448.27 | 54.56 | 201.30 | 420.30 | 453.55 | 480.85 | 1206.90 |
| medIncome | 0 | 1.00 | 47063.28 | 12040.09 | 22640.00 | 38882.50 | 45207.00 | 52492.00 | 125635.00 |
| popEst2015 | 0 | 1.00 | 102637.37 | 329059.22 | 827.00 | 11684.00 | 26643.00 | 68671.00 | 10170292.00 |
| povertyPercent | 0 | 1.00 | 16.88 | 6.41 | 3.20 | 12.15 | 15.90 | 20.40 | 47.40 |
| studyPerCap | 0 | 1.00 | 155.40 | 529.63 | 0.00 | 0.00 | 0.00 | 83.65 | 9762.31 |
| MedianAge | 0 | 1.00 | 45.27 | 45.30 | 22.30 | 37.70 | 41.00 | 44.00 | 624.00 |
| MedianAgeMale | 0 | 1.00 | 39.57 | 5.23 | 22.40 | 36.35 | 39.60 | 42.50 | 64.70 |
| MedianAgeFemale | 0 | 1.00 | 42.15 | 5.29 | 22.30 | 39.10 | 42.40 | 45.30 | 65.70 |
| AvgHouseholdSize | 0 | 1.00 | 2.48 | 0.43 | 0.02 | 2.37 | 2.50 | 2.63 | 3.97 |
| PercentMarried | 0 | 1.00 | 51.77 | 6.90 | 23.10 | 47.75 | 52.40 | 56.40 | 72.50 |
| PctNoHS18_24 | 0 | 1.00 | 18.22 | 8.09 | 0.00 | 12.80 | 17.10 | 22.70 | 64.10 |
| PctHS18_24 | 0 | 1.00 | 35.00 | 9.07 | 0.00 | 29.20 | 34.70 | 40.70 | 72.50 |
| PctSomeCol18_24 | 2285 | 0.25 | 40.98 | 11.12 | 7.10 | 34.00 | 40.40 | 46.40 | 79.00 |
| PctBachDeg18_24 | 0 | 1.00 | 6.16 | 4.53 | 0.00 | 3.10 | 5.40 | 8.20 | 51.80 |
| PctHS25_Over | 0 | 1.00 | 34.80 | 7.03 | 7.50 | 30.40 | 35.30 | 39.65 | 54.80 |
| PctBachDeg25_Over | 0 | 1.00 | 13.28 | 5.39 | 2.50 | 9.40 | 12.30 | 16.10 | 42.20 |
| PctEmployed16_Over | 152 | 0.95 | 54.15 | 8.32 | 17.60 | 48.60 | 54.50 | 60.30 | 80.10 |
| PctUnemployed16_Over | 0 | 1.00 | 7.85 | 3.45 | 0.40 | 5.50 | 7.60 | 9.70 | 29.40 |
| PctPrivateCoverage | 0 | 1.00 | 64.35 | 10.65 | 22.30 | 57.20 | 65.10 | 72.10 | 92.30 |
| PctPrivateCoverageAlone | 609 | 0.80 | 48.45 | 10.08 | 15.70 | 41.00 | 48.70 | 55.60 | 78.90 |
| PctEmpPrivCoverage | 0 | 1.00 | 41.20 | 9.45 | 13.50 | 34.50 | 41.10 | 47.70 | 70.70 |
| PctPublicCoverage | 0 | 1.00 | 36.25 | 7.84 | 11.20 | 30.90 | 36.30 | 41.55 | 65.10 |
| PctPublicCoverageAlone | 0 | 1.00 | 19.24 | 6.11 | 2.60 | 14.85 | 18.80 | 23.10 | 46.60 |
| PctWhite | 0 | 1.00 | 83.65 | 16.38 | 10.20 | 77.30 | 90.06 | 95.45 | 100.00 |
| PctBlack | 0 | 1.00 | 9.11 | 14.53 | 0.00 | 0.62 | 2.25 | 10.51 | 85.95 |
| PctAsian | 0 | 1.00 | 1.25 | 2.61 | 0.00 | 0.25 | 0.55 | 1.22 | 42.62 |
| PctOtherRace | 0 | 1.00 | 1.98 | 3.52 | 0.00 | 0.30 | 0.83 | 2.18 | 41.93 |
| PctMarriedHouseholds | 0 | 1.00 | 51.24 | 6.57 | 22.99 | 47.76 | 51.67 | 55.40 | 78.08 |
| BirthRate | 0 | 1.00 | 5.64 | 1.99 | 0.00 | 4.52 | 5.38 | 6.49 | 21.33 |
## # A tibble: 34 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 PctSomeCol18_24 2285 75.0
## 2 PctPrivateCoverageAlone 609 20.0
## 3 PctEmployed16_Over 152 4.99
## 4 avgAnnCount 0 0
## 5 avgDeathsPerYear 0 0
## 6 TARGET_deathRate 0 0
## 7 incidenceRate 0 0
## 8 medIncome 0 0
## 9 popEst2015 0 0
## 10 povertyPercent 0 0
## # … with 24 more rows
Within the data set there are 3047 observations for 33
feature variables and 1 target variable, TARGET_deathRate
(Mean per capita (100,000) cancer mortalities). Only 3 variables are
missing any observations: PctSomeCol18_24 (2285 missing),
PctEmployed16_Over (152 missing),
PctPrivateCoverageAlone (609 missing).
PctSomeCol18_24 is missing about 75% of its observations
and should be removed as a variable used in prediction.
PctEmployed16_Over is missing about 5% of its observations
and PctPrivateCoverageAlone about 20%, so we can use
imputation methods to fix the missingness problem.
Most of the missing data for PctEmployed16_Over
occurs when there is a large value of PctWhite. There is a
noticeable amount of missing values for middle values of
PctBlack and a very small amount of missing values for
small values of PctAsian.
There is a similar trend for missingness in
PctPrivateCoverageAlone. Most of the missing data occurs
when there is a large value of PctWhite. There is a
noticeable amount of missing values for middle values of
PctBlack and a very small amount of missing values for
small values of PctAsian.
Additionally, most of the missing values for these same two
variables occur with lower values of medIncome.
The missingness is concentrated between values of 30 and 50 for
MedianAge.
It is essential to identify which variables may be heavily involved
in the process of developing an accurate and precise model to predict
TARGET_deathRate. One method to approach this includes
simply analyzing the data and predicting which variables have the
greatest association with the response variable. While we hypothesized
that factors like income, insurance, and race may be heavily involved in
predicting cancer mortalities, there are methods to more accurately find
patterns and associations between variables. One of which includes the
correlation plot. It is important to note that variable with missing
data (PctPrivateCoverageAlone and
PctEmployed16_Over) are not represented in the correlation
plot and are independently explored. PctSomeCol18_24 is not
explored in this EDA due to its extreme extent of missingness. It will
not be considered in our future recipe(s).
From the correlation plot above, we could identify variables that may
have patterns aligning with that of the target variable
TARGET_deathRate; the first column serves to identify this
association. Here, the blue squares refer to predictor variables with
perfectly positive linear correlation with the response variable while
the red squares correspond to predictor variables with perfectly
negative linear correlation with the response variable. From this, we
can see that povertyPercent, PctHS25_Over,
PctUnemployed16_Over, PctPublicCoverage, and
PctPublicCoverageAlone are most notably positively
correlated with TARGET_deathRate. By contrast, the
predictor variables negatively correlated with the response variable are
medIncome, PctBachDeg25_Over, and
PctPrivateCoverage.
These data make sense when you extend off the numerical data and
understand what the variables truly mean. The positive predictor
variables generally correspond to characteristics tying back to or a
result of low income: poverty, low level of education, unemployment, and
government-provided insurance. Another notable but less positive
predictor variable is PctBlack in which factors like the
systematic racism faced may place them in conditions that are not ideal
or makes it hard for them to get tested and receive treatment. This is
also a consistent trend in the negatively correlated predictor
variables: a higher income, better education, and private more extensive
insurance plan helps prevent as well as treat cancer, leading to better
outcomes of less cancer mortalities.
It is also interesting to utilize the correlation plot to identify relationships that exist between predictor variables. While certain relationships are more obvious, such as a higher income leading to a higher degree of education and a lower degree of government-provided healthcare, there are patterns between variables that may not be well known.
From the data above, it is clear to see the contrasting association
between marriage rates and race. While it is negative correlated among
black people, it is positively correlated in white people. While this
simply may be an observed trend due to the small sample size we are
working with in which observations were taken from single black
individuals and married white individuals, it may be important to keep
in mind when developing models to predict the response variable.
Standard variable explorations for the domain area that are unsurprising and mainly conducted out of convention. Findings that don’t seem interesting or important, but show some potential.
For povertyPercent, PctHS25_Over,
PctUnemployed16_Over, PctPublicCoverage, and
PctPublicCoverageAlone, both the boxplot and density plots
above show the distribution of the strong positive predictor variables
that do not need to be transformed in any way such as, for example, with
log transformations. They can be used as is for predicting.
The graphs above support the strong positive relationship between
povertyPercent, PctHS25_Over,
PctUnemployed16_Over , PctPublicCoverage,
PctPublicCoverageAlone, and TARGET_deathRate
that was seen in the correlation plot.
These two plots represent the two most positive relationships
from the correlation plot. They both show naturally occurring
relationships, as a larger population results in more cancer-related
deaths, even if it is at the same rate as a smaller population.
Similarly, a higher number of reported cancer cases per year will
naturally correlate with a higher cancer-related mortality rate.
For medIncome, PctBachDeg25_Over, and
PctPrivateCoverage, both the boxplot and density plots
above show the distribution of the strong negative predictor variables
that do not need to be transformed in any way such as, for example, with
log transformations. They can be used as is for predicting.
The graphs above support the strong negative relationship
between medIncome, PctBachDeg25_Over,
PctPrivateCoverage and TARGET_deathRate that
was seen in the correlation plot.
These two plots represent the two most negative relationships
from the correlation plot. They both show naturally occurring
relationships, as a high percent of private coverage means a low percent
of reliance on government assistance. Similarly, a high percent of
county population identifying as white means a low percent of county
population identifying as another race, such as black.
Since PctPrivateCoverageAlone is missing 20% of
its data, we independently explored variables that we though would have
a strong naturally occurring relationship with
PctPrivateCoverageAlone, such as
PctPublicCoverage. The plot above shows a strong negative
relationship that could be utilized for an imputation step in a future
recipe.
Since PctEmployed16_Over is missing 5% of its
data, we independently explored variables that we though would have a
strong naturally occurring relationship with
PctEmployed16_Over, such as
PctUnemployed16_Over. The plot above shows a strong
negative relationship that could be utilized for an imputation step in a
future recipe.
From this analysis, we got a good understanding of the variables that
had missingness that needs to be addressed in our recipe(s) and model
development (2285 missing for PctSomeCol18_24, 152 missing
for PctEmployed16_Over, and 609 missing for
PctPrivateCoverageAlone). In addition, we investigated
relationships between the response and predictor variables as well as
between various predictor variables. This allowed us to pin-point
variables of interest that will not only help us during our imputation
step but also variables that will be integral in our model to predict
the reseponse variable TARGET_deathRate.